library(leaflet)
library(rvest)
## Loading required package: xml2
## Registered S3 method overwritten by 'rvest':
## method from
## read_xml.response xml2
library(geojsonio)
##
## Attaching package: 'geojsonio'
## The following object is masked from 'package:base':
##
## pretty
library(tidyverse)
## Registered S3 method overwritten by 'dplyr':
## method from
## print.location geojsonio
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.1 v purrr 0.3.2
## v tibble 2.1.1 v dplyr 0.8.0.1
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag() masks stats::lag()
## x purrr::pluck() masks rvest::pluck()
library(ggrepel)
library(corrplot)
## corrplot 0.84 loaded
library(jsonlite)
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
## The following object is masked from 'package:geojsonio':
##
## validate
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
If you’re happy and you know it. claps
main.data <- read.csv("gdp.countries.csv")
corrplot.data <- main.data %>%
filter(is.na(gdp.2017) == FALSE) %>%
select(Economy..GDP.per.Capita., Family, Health..Life.Expectancy., Freedom, Generosity, Trust..Government.Corruption.,
Dystopia.Residual, gdp.2017, Happiness.Score)
colnames(corrplot.data) <- c("Economy", "Family", "Life Expectancy", "Freedom", "Generosity",
"Trust in Government", "Dystopia Residual", "GDP/Capita (2017)", "Happiness Score")
corrplot(cor(corrplot.data), tl.col = "black")
Key Stone graphic:
The relationship between the different variables contributing to a nation’s happiness is shown above. The key ones are: 1. Economy 2. Family 3. Life Expectancy
Introduction:
We wanted to see what factors into the different levels of happiness across countries. While doing our research, we came across the World Happiness Index which assigns a happiness score to each country. This is the weighted average of a few key variables which explain the variance in happiness within that nation. These variables include the GDP per capita, having family or support systems in place, generosity, life expectancy and the level of trust in the government. Interestingly enough, there is also a variable called the dystopia residual which includes the contribution to happiness unexplained by the aforementioned factors. We were not only excited to learn more about which countries were the happiest, but also about how much importance each variable held.
We hypothesize that the countries with the greatest amount of economic growth would be the happiest since they would have the highest life expectancy, standard of living and comfort levels. It is also important to note that a lot of these variables are correlated with each other, whilst simultameously contributing to the happiness of a country. Therefore, we want to run a regression to determine which of these variables have the most impact. We also aim to see if there are any region specific trends in happiness. For instance, does the East place a greater level of importance on the variable Family.
Methods:
We intend to build individual graphs of our variables with happiness scores to see if there are any anomalies.
We also intend on using regression to determine which variable(s) contribute the most to happiness.
We also want to visually represent the contribution of each of these variables to the happiness score. We have two graphics which do this.
Let’s look at how happy different countries in the world are in 2017 (this is the latest year the data was available for). This will help us begin the analysis. We used a leaflet to graphically depict this information.
countries <- geojson_read("custom.geo.json", what = "sp")
map <- countries %>%
leaflet() %>%
addTiles()
map.data <- main.data %>%
filter(is.na(Happiness.Score) == FALSE)
#Look at my map
bins <- c(2, 3, 4, 5, 6, 7,8, Inf)
colors <- colorBin(palette = "YlOrRd",
domain = map.data$Happiness.Score,
bins = bins)
map %>%
addPolygons(fillColor = ~colors(map.data$Happiness.Score),
weight = 2,
opacity = 1,
color = "white",
dashArray = "3",
fillOpacity = 0.7) %>%
addLegend("bottomleft", pal = colors, values = bins,
title = "Happiness Score (2017)",
opacity = 1)
This leaflet agrees with our innitial hypothesis that the geographical north is happier on average than the geographical south. This is probably because the north tends to do economically better than the south and thus people are happier there. The same can also be said about the west and the east.
In the data set, we have a column for “Economy”, which shows the contribution to happiness scores made by the economy in a nation. This was a measurement decided by the Happiness Index. We have ourselves included the actual GDP per capita in 2017 to the data set.
Let’s explore the impact of GDP on happiness further.
gggdp <- main.data %>%
ggplot(aes(x = Economy..GDP.per.Capita., y = Happiness.Score, text = Country)) +
geom_point(aes(colour = continent, size = main.data$gdp.2017)) +
geom_smooth(method='lm') +
theme_bw() +
labs(color = "Continent", size = "GDP per capita (USD)") +
ggtitle("The Contribution of GDP per capita to Happiness Score") +
geom_text(aes(label = ifelse(Economy..GDP.per.Capita. < 0.07, as.character(Country),"")), hjust = 0, vjust = 0) +
xlab("GDP's contribution") + ylab("Happiness Score")
ggplotly(gggdp, tooltip = c("x", "y", "Country"))
As predicted, there is a positive linear relationship between GDP and Happiness Score. The happiest nations tend to be the wealthiest ones (a lot of European countries) and a lot of the poorer countries (mostly the African ones) tend to be the least happy. This effect is visualized better by embedding the actual GDP per capita to correspond to the size of the dot for each country. However, clearly not all of this variation can be explained by GDP because some of the countries with the largest value of GDP (the blue European cluster at the top) do not have GDP contribute to their overall happiness scores as much as their Asian counterparts do (even though the values of their GDP are smaller). This makes sense because money promotes growth, development and a higher standard of living, which all contributes to higher levels of happiness.
The Central African Republic and Somalia stand out here. The Central African Republic does not come as a surprise. It’s political unrest means that its GDP and the GDP’a contribution to as well as the happiness is low. Somalia is an interesting outlier here because it has a happiness score roughly in the middle of all the countries. However, GDP (which is also very small) barely contributes to its happiness. This means that the impact of the other variables is a lot more pertinent in this case. We assume that this happiness comes from family.
Let’s look at family’s contribution to happiness in different continents.
Family is calculated by asking the surveyers if they have friends or family that they could count on if needed. Yes was a 1, and no was a 0, and the average of all values for a nation is the Family factor.
ggfam <- main.data %>%
ggplot(aes(x = Family, y = Happiness.Score)) +
geom_point(aes(colour = continent)) +
theme_bw() +
ggtitle("The Contribution of Family to Happiness Score") +
xlab("Family's contribution") + ylab("Happiness Score")+
labs(color = "Continent") +
geom_text(aes(label = ifelse(Family < 0.25, as.character(Country),"")), hjust = 0, vjust = 0) +
geom_text(aes(label = ifelse(Country == "Somalia", as.character(Country),"")), hjust = 0, vjust = 0)
ggplotly(ggfam, tooltip = c("x", "y", "Country"))
The relationship between Family’s contribution and happiness score is linear. This could also be seen as exponential, however, as seen in our keystone graphic, this can and we have chosen to interpret it as linear. This still suggests that there is a strong positive relationship between the happiness score and that family contributes a great deal to it. However, our hypothesis that this relationship between family and happiness would be stronger in the eastern continents was incorrect. The Central African Republic is again unsurprisingly an outlier. This is attributable to the high level of unrest in the country, so if there is no familial support available in the country either, where is their happiness coming from? Somalia has a notable contribution of Family into its Happiness Score as predicted.
This relationship seems to hold in the Americas and the Europe as well and here the contribution to happiness is infact higher. The question is, is this contribution higher because of the fact that the happiness scores of these areas is higher or do they value family and social support more, we hope our regression split by different continents will shed light on this.
Another key variable is the Life Expectancy. The life expectancy value for each country is the average of the number of years a healthy child can expect to live in that country as a factor.
gglife <- main.data %>%
ggplot(aes(x = Health..Life.Expectancy., y = Happiness.Score, text = Country)) +
geom_point(aes(colour = continent)) +
geom_text(aes(label = ifelse(Happiness.Score < 3, as.character(Country),"")), hjust = 0, vjust = 0) +
theme_bw() +
ggtitle("The Contribution of Life Expectancy to Happiness Score") +
xlab("Life Expectancy's contribution") + ylab("Happiness Score")+
labs(color = "Continent") +
geom_smooth(method='lm', se = FALSE)
ggplotly(gglife, tooltip = c("x", "y", "Country"))
There is clearly a positive relationship between life expectancy and happiness as anticipated. However, this also says something interesting about the African continent (the red chunk), some countries here are performing relatively well on the happiness index despite having low life expectancy and hence its contribution to happiness. The Central African Republic, as unexpected is quite unhappy, and its poor life expectancy does not contribute towards its happiness, so the question is what does.
The next variable of interest is Freedom. Freedom is calculated by asking people if they are satisfied with the amount of freedom they have to make life choices in their countries. The average of these numbers helps determine the factor value of this variable.
gg1 <- main.data %>%
ggplot(aes(x = Freedom, y = Happiness.Score, text = Country)) +
geom_point(aes(colour = continent)) +
geom_text(aes(label = ifelse(Happiness.Score < 3, as.character(Country),"")), hjust = 0, vjust = 0) +
geom_text(aes(label = ifelse(Country == "Rwanda", as.character(Country),"")), hjust = 0, vjust = 0) +
geom_text(aes(label = ifelse(Country == "Cambodia", as.character(Country),"")), hjust = 0, vjust = 0) +
theme_bw() +
ggtitle("The Contribution of Freedom to Happiness Score") +
xlab("Freedom's contribution") + ylab("Happiness Score")+
labs(color = "Continent")
ggplotly(gg1, tooltip = c("x", "y", "Country"))
This also shows that there is a positive relationship between the level of freedom and consequently its contribution to the happiness score. We finally know where some of happiness in the Central African Republic is coming from: freedom. Burundi also appears as an outlier for a lot of these graphs. We believe its behaviour is quite similar to that of the Central African Republic which is why we are not commenting on it separately. However, interestingly enough Rwanda and Cambodia also stand out here. They have a relatively low level of happiness despite having a lot of freedom, which corroborates our finding that freedom is not as important as other variables like the economy etc for happiness score calculations.
Generosity: The next variable we are going to look at is generosity. This is calculated by asking the people if they have donated to charity in the past month. We believe that this is a biased measure of generosity because it will mean that the western world, with more resources is on average more generous than the east. This is obviously not true and does not mean that people in the East are stingy. However, lets still look at this variable’s contribution to happiness.
ggg <- main.data %>%
ggplot(aes(x = Generosity, y = Happiness.Score, text = Country)) +
geom_point(aes(colour = continent)) +
geom_text(aes(label = ifelse(Happiness.Score < 3, as.character(Country),"")), hjust = 0, vjust = 0) +
geom_text(aes(label = ifelse(Generosity > 0.6, as.character(Country),"")), hjust = 0, vjust = 0) +
theme_bw() +
ggtitle("The Contribution of Generosity to Happiness Score") +
xlab("Generosity's contribution") + ylab("Happiness Score")+
labs(color = "Continent")
ggplotly(ggg, tooltip = c("x", "y", "Country"))
Ironically, despite the definition of generosity, the two most generous countries are Myanmar and Indonesia, however, they are not as happy as we would expect them to be, which shows that Generosity does not have much of an impact on the Happiness Score. This is further corroborated by the lack of linear relationship observed with this variable here in contrast to all the others. However, it is common knowledge that more giving should result in happier nations, perhaps this is seen because generosity was poorly measured as we alluded to before.
Government Trust:
Government Trust was measured by asking people whether corruption was rampant in the government and the businesses of their countries and the average of these answers was used to calculate the factor. For this metric, we hypothesize that poorer countries will have a lower level trust in their government and consequently lesser happiness. Let’s take a look.
ggcorrupt <- main.data %>%
ggplot(aes(x = Trust..Government.Corruption., y = Happiness.Score)) +
geom_point(aes(colour = continent, text = Country)) +
geom_text(aes(label = ifelse(Trust..Government.Corruption. > 0.41, as.character(Country),"")), hjust = 0, vjust = 0) +
#geom_text_repel(aes(label = ifelse(Economy..GDP.per.Capita. > 0.5, as.character(Country),"")), hjust = 0, vjust = 0) +
#geom_text_repel(aes(label = ifelse(is.na(sub_region) == TRUE, as.character(Country),"")), hjust = 0, vjust = 0) +
theme_bw() +
ggtitle("The Contribution of Government Trust to Happiness Score") +
xlab("Government Trust's contribution") + ylab("Happiness Score")+
labs(color = "Continent")
## Warning: Ignoring unknown aesthetics: text
ggplotly(ggcorrupt, tooltip = c("x", "y", "Country"))
This shows a positive relationship between the two variables as shown. However, Rwanda clearly stands out. This is because it has a high level of trust in its government but simultaneously a very low happiness score. This stands out in comparision to Singapore and Qatar, where the high level of trust in the government is also matched by a high happiness score. Moreover, there are a lot of countries (concentrated to the left of the graph) which have a high level of happiness but not a lot of trust in their government, suggesting that this metric is not as closely related to happiness score as compared to the other varaibles explored earlier.
Dystopia Residual:
This variable measures the unexplained variation in happiness that is not factored in through the other variables.
ggdystopia <- main.data %>%
ggplot(aes(x = Dystopia.Residual, y = Happiness.Score, text = Country)) +
geom_point(aes(colour = continent)) +
theme_bw() +
ggtitle("The Contribution of Dystopia Residual to Happiness Score") +
xlab("Dystopia Residual") + ylab("Happiness Score")+
labs(color = "Continent")
ggplotly(ggdystopia, tooltip = c("x", "y", "Country"))
We cannot really make a comment on this relationship besides the fact that it is difficult to fully capture happiness using numeric variables.
Here is a bar chart showing the breakup of happiness by all these variables including the unexplainable dystopia variable for all countries.
experiment <- main.data %>%
gather(key = "Happiness Contributors",
value = "Measurement",
Family, Health..Life.Expectancy.,Generosity, Economy..GDP.per.Capita., Trust..Government.Corruption., Dystopia.Residual)
ggexperiment <- experiment %>%
filter(is.na(Happiness.Score) == FALSE) %>%
ggplot(aes(x = Country, y = Happiness.Score, fill =`Happiness Contributors`, text = Measurement))+
geom_bar(stat = "identity", position = "stack") +
scale_fill_brewer(palette = "Accent") +
ggtitle("The Distribution of Happiness Score across Countries") +
xlab("Happiness Score")+
labs(color = "Happiness Contributors") +
coord_flip()
ggplotly(ggexperiment, tooltip = c("`Happiness Contributors`", "Measurement"))
Clearly, some variables such as Economy and Family contribute much more to Happiness Score as compared to some others such as Generosity. To learn more about which variables contribute the most, we built a Linear regression model.
# lets look at variables and the find the most important factors for predicting happiness
corrplot.data <- main.data %>%
filter(is.na(gdp.2017) == FALSE) %>%
select(Economy..GDP.per.Capita., Family, Health..Life.Expectancy., Freedom, Generosity, Trust..Government.Corruption.,
Dystopia.Residual, gdp.2017, Happiness.Score)
# Linear regression model
fun.model <- lm(Happiness.Score ~ .,
data = corrplot.data)
# Run step to find best variables for regression
summary(fun.model)
##
## Call:
## lm(formula = Happiness.Score ~ ., data = corrplot.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.547e-04 -2.376e-04 -5.180e-06 2.332e-04 4.913e-04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.769e-04 1.621e-04 1.708 0.09 .
## Economy..GDP.per.Capita. 1.000e+00 1.408e-04 7104.780 <2e-16 ***
## Family 9.998e-01 1.287e-04 7770.763 <2e-16 ***
## Health..Life.Expectancy. 9.998e-01 1.914e-04 5225.018 <2e-16 ***
## Freedom 1.000e+00 2.056e-04 4864.793 <2e-16 ***
## Generosity 1.000e+00 2.053e-04 4872.257 <2e-16 ***
## Trust..Government.Corruption. 9.995e-01 3.243e-04 3082.241 <2e-16 ***
## Dystopia.Residual 1.000e+00 4.836e-05 20678.322 <2e-16 ***
## gdp.2017 3.257e-09 2.305e-09 1.413 0.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0002785 on 128 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.891e+08 on 8 and 128 DF, p-value: < 2.2e-16
step(fun.model)
## Start: AIC=-2234.31
## Happiness.Score ~ Economy..GDP.per.Capita. + Family + Health..Life.Expectancy. +
## Freedom + Generosity + Trust..Government.Corruption. + Dystopia.Residual +
## gdp.2017
##
## Df Sum of Sq RSS AIC
## <none> 0.000 -2234.31
## - gdp.2017 1 0.000 0.000 -2234.19
## - Trust..Government.Corruption. 1 0.737 0.737 -699.88
## - Freedom 1 1.835 1.835 -574.84
## - Generosity 1 1.841 1.841 -574.42
## - Health..Life.Expectancy. 1 2.117 2.117 -555.27
## - Economy..GDP.per.Capita. 1 3.915 3.915 -471.06
## - Family 1 4.683 4.683 -446.51
## - Dystopia.Residual 1 33.162 33.162 -178.34
##
## Call:
## lm(formula = Happiness.Score ~ Economy..GDP.per.Capita. + Family +
## Health..Life.Expectancy. + Freedom + Generosity + Trust..Government.Corruption. +
## Dystopia.Residual + gdp.2017, data = corrplot.data)
##
## Coefficients:
## (Intercept) Economy..GDP.per.Capita.
## 2.769e-04 1.000e+00
## Family Health..Life.Expectancy.
## 9.998e-01 9.998e-01
## Freedom Generosity
## 1.000e+00 1.000e+00
## Trust..Government.Corruption. Dystopia.Residual
## 9.995e-01 1.000e+00
## gdp.2017
## 3.257e-09
Results:
As our results show the variables economy(GDP/per capita), Family and Life Expectancy were the most important contributors towards measuring the Happiness Score of countries. We ran the step function to see the ideal set of variables to run this regression. The corrplot shown below further corroborates this result.
corrplot.data <- main.data %>%
filter(is.na(gdp.2017) == FALSE) %>%
select(Economy..GDP.per.Capita., Family, Health..Life.Expectancy., Freedom, Generosity, Trust..Government.Corruption.,
Dystopia.Residual, gdp.2017, Happiness.Score)
colnames(corrplot.data) <- c("Economy", "Family", "Life Expectancy", "Freedom", "Generosity",
"Trust in Government", "Dystopia Residual", "GDP/Capita (2017)", "Happiness Score")
corrplot(cor(corrplot.data), tl.col = "black")
In the corrplot above, we see that the biggest correlations (observed through the size of the circle) with respect to Happiness Scores are seen with the variables Economy, Family, and Life Expectancy. We are taking GDP per capita and Economy to be equivalent for reasons mentioned above.
Conclusion:
We learned that Economy, Life Expectancy, and Family were the most important variables that contributed to the happiness of nations. We were curious to see why family was an important indicator of happiness but the other variables were not, so we looked deeper into how these metrics were measured. Our research showed that this survey interviewed a 1000 respondents as opposed to a fixed percentage of the population. This means that the relative errors in more populated countries are larger. This makes sense as Economy and Life Expectancy are absolute values which are free of the bias of the small sample size (1000 respondants) that the other variables suffer from. In the future, to strengthen the reliability of these estimates, perhaps, a fixed percentage of each country’s population should be surveyed instead of a constant 1000 respondants to account for population differences. However, we recognize that this poses an administrative and economic burden which can be overcome through analyzing these trends over time and by different samples.